When Was It Written? Automatically Determining Publication Dates
Identifieur interne : 000403 ( Main/Exploration ); précédent : 000402; suivant : 000404When Was It Written? Automatically Determining Publication Dates
Auteurs : Anne Garcia-Fernandez [France] ; Anne-Laure Ligozat [France] ; Marco Dinarelli [France] ; Delphine Bernhard [France]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2011.
Abstract
Abstract: Automatically determining the publication date of a document is a complex task, since a document may contain only few intra-textual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time. In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.
Url:
DOI: 10.1007/978-3-642-24583-1_22
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000105
- to stream Istex, to step Curation: 000103
- to stream Istex, to step Checkpoint: 000059
- to stream Main, to step Merge: 000408
- to stream Main, to step Curation: 000403
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">When Was It Written? Automatically Determining Publication Dates</title>
<author><name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
</author>
<author><name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
</author>
<author><name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
</author>
<author><name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B</idno>
<date when="2011" year="2011">2011</date>
<idno type="doi">10.1007/978-3-642-24583-1_22</idno>
<idno type="url">https://api.istex.fr/document/1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000105</idno>
<idno type="wicri:Area/Istex/Curation">000103</idno>
<idno type="wicri:Area/Istex/Checkpoint">000059</idno>
<idno type="wicri:doubleKey">0302-9743:2011:Garcia Fernandez A:when:was:it</idno>
<idno type="wicri:Area/Main/Merge">000408</idno>
<idno type="wicri:Area/Main/Curation">000403</idno>
<idno type="wicri:Area/Main/Exploration">000403</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">When Was It Written? Automatically Determining Publication Dates</title>
<author><name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
<affiliation wicri:level="1"><country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
<affiliation wicri:level="1"><country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country xml:lang="fr">France</country>
<wicri:regionArea>ENSIIE, Evry</wicri:regionArea>
<placeName><region type="région">Île-de-France</region>
<settlement type="city">Évry (Essonne)</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
<affiliation wicri:level="1"><country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
<affiliation wicri:level="1"><country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2011</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B</idno>
<idno type="DOI">10.1007/978-3-642-24583-1_22</idno>
<idno type="ChapterID">22</idno>
<idno type="ChapterID">Chap22</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Automatically determining the publication date of a document is a complex task, since a document may contain only few intra-textual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time. In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Île-de-France</li>
</region>
<settlement><li>Orsay</li>
<li>Évry (Essonne)</li>
</settlement>
</list>
<tree><country name="France"><region name="Île-de-France"><name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
</region>
<name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
<name sortKey="Bernhard, Delphine" sort="Bernhard, Delphine" uniqKey="Bernhard D" first="Delphine" last="Bernhard">Delphine Bernhard</name>
<name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
<name sortKey="Dinarelli, Marco" sort="Dinarelli, Marco" uniqKey="Dinarelli M" first="Marco" last="Dinarelli">Marco Dinarelli</name>
<name sortKey="Garcia Fernandez, Anne" sort="Garcia Fernandez, Anne" uniqKey="Garcia Fernandez A" first="Anne" last="Garcia-Fernandez">Anne Garcia-Fernandez</name>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
<name sortKey="Ligozat, Anne Laure" sort="Ligozat, Anne Laure" uniqKey="Ligozat A" first="Anne-Laure" last="Ligozat">Anne-Laure Ligozat</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000403 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000403 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:1A59EF625F8BD73E9C9DA7E5A6068BBA5B49114B |texte= When Was It Written? Automatically Determining Publication Dates }}
This area was generated with Dilib version V0.6.32. |